\(\newcommand{\rel}{{\rm I\hspace{-0.7mm}R}}\) \(\newcommand{\bm}[1]{\boldsymbol{#1}}\) \(\newcommand{\bms}[1]{\boldsymbol{\scriptsize #1}}\) \(\newcommand{\proper}[1]{\text{#1}}\) \(\newcommand{\pE}{\proper{E}}\) \(\newcommand{\pV}{\proper{Var}}\) \(\newcommand{\pCov}{\proper{Cov}}\) \(\newcommand{\pACF}{\proper{ACF}}\) \(\newcommand{\I}{\bm{\mathcal{I}}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\pP}{\proper{P}}\) \(\newcommand{\pAIC}{\textsf{AIC}}\) \(\DeclareMathOperator{\diag}{diag}\)

Coursework solutions

Author

MA40198: Applied Statistical Inference

Preliminaries

Consider the following parametric family of probability density functions:

\[ \mathcal F_1 = \left\{ f(y|\bm{\lambda})=\exp(\lambda_1 \,y+\lambda_2 \log(\cos(\lambda_1)))\,g(y|\lambda_2) \,:\, -\frac{\pi}{2}<\lambda_1<\frac{\pi}{2}, \lambda_2>0 \right\} \]

The function \(g(y|\lambda_2)\) is itself a probability density function given by: \[ g(y|\lambda_2)=\frac{2^{\lambda_2-2}\Gamma^2\left(\frac{\lambda_2}{2}\right)}{\pi\Gamma(\lambda_2)}\prod_{j=0}^\infty \left\{1+\left(\frac{y}{\lambda_2+2j}\right)^2\right\}^{-1}\,,\quad y \in \rel \]

Here

\[ \Gamma(x)=\int_{0}^\infty t^{x-1}\exp(-t)dt\,,\quad x>0 \]

is the standard Gamma function.

For any computations involving \(g(y|\lambda_2)\) you should use the following approximation:

\[ g_N(y|\lambda_2)=\frac{2^{\lambda_2-2}\Gamma^2\left(\frac{\lambda_2}{2}\right)}{\pi\Gamma(\lambda_2)}\prod_{j=0}^N \left\{1+\left(\frac{y}{\lambda_2+2j}\right)^2\right\}^{-1}\,,\quad y \in \rel \]

Unless otherwise stated, you should use \(N=10,000\).

Questions

Question 1

Consider the following observed sample:

Code
y_sample_q1 <- scan("http://people.bath.ac.uk/kai21/ASI/CW_2023/y_sample_q1.txt")

Plot 40 contours of the negative loglikelihood function of the parameter \(\bm{\lambda}\) over the region defined by \(-\pi/2<\lambda_1<\pi/2\) and \(0<\lambda_2<50\). The contours should be sufficiently smooth and cover the entire region. You should indicate a smaller region delimited by a contour that contains the global minimum.

Solution to Question 1

The family \(\mathcal F_1\) is called the Generalised Hyperbolic Secant (GHS). The log of the densities is given by:

\[ \log f(y|\lambda_1,\lambda_2)=\lambda_1\,y+\lambda_2\,\log(\cos(\lambda_1))+(\lambda_2-2)\log 2+2\log \Gamma\left(\frac{\lambda_2}{2}\right)-\log \pi -\log \Gamma(\lambda_2) -\sum_{j=0}^N \log \left(1+\frac{y^2}{(\lambda_2+2j)^2}\right) \]

We can conveniently define \(\log f(y|\lambda_1,\lambda_2)=-\infty\) if either \(|\lambda_1|\geq \pi/2\) or \(\lambda_2\leq 0\).

The following code computes the density function as a function of the log density:

Code
#| code-fold: show

# density of the generalised hyperbolic secant distribution
dghs <-function(y,
               lambda1 = 0, # lambda1 is bounded between +- pi/2
               lambda2 = 1 , # lambda2 is positive
               log     = F,
               N       = 1000){

# computes log density by default as is more numerically stable
  
if ((abs(lambda1)>pi/2)|(lambda2<=0)){
  res <- -Inf
}else{  
  aux <- function(y,lambda1,lambda2,N){
      (lambda2-2)*log(2)+
        2*lgamma(lambda2/2)-
          log(pi)-
            lgamma(lambda2)-
              sum(log(1+((y/(lambda2+2*(0:N)))^2)))+
                lambda1*y+lambda2*log(cos(lambda1))
}

# Use sapply to vectorise the log density wrt to the sample vector y
res <- sapply(y,
              FUN     = aux,
              lambda1 = lambda1,
              lambda2 = lambda2,
              N       = N)

}

if (!log){
  res <- exp(res) # returns the density if needed 
}

res

}

Next we code for negative loglikelihood function \(\phi(\bm{\lambda}|\bm{y})\) in a format appropriate for plotting the contours.

Code
nll_plot <-  function(lambda1,
                      lambda2,
                      y,
                      N=1000){
  
  
  sum_log_dens<-function(y,lambda1,lambda2,N){
    sum(dghs(y,lambda1,lambda2,log = TRUE))
  }
  
  -mapply(FUN      = sum_log_dens,
          lambda1  = lambda1,
          lambda2  = lambda2,
          MoreArgs = list(y = y,
                          N = N))
  
}

We will fix \(N=10,000\) for further calculations

Code
N_fix <- 10000

Now we plot the contours of the negative loglikelihood \(\phi(\bm{\lambda}|\bm{y})\)

Code
N_grid      <- 100


lambda1_grid <- seq(from   = -1.57,
                   to     = 1.57,
                   length = N_grid)

theta2_grid <- seq(from   = 0.1,
                   to     = 50,
                   length = N_grid)




nll_grid <- outer(X   = lambda1_grid,
                  Y   = theta2_grid,
                  FUN = nll_plot ,
                  y   = y_sample_q1,
                  N   = N_fix)



levels <-quantile(x    = nll_grid,
                 probs = seq(from   = (0.01),
                             to     = (0.99),
                             length = 40))




contour(x      = lambda1_grid,
        y      = theta2_grid,
        z      = nll_grid,
        levels = levels,
        xlab   = expression(lambda[1]),
        ylab   = expression(lambda[2]))

contour(x      = lambda1_grid,
        y      = theta2_grid,
        z      = nll_grid,
        levels =levels[1],
        col="red",
        add=T)

Contours of the negative loglikelihood (red contour conatins the global minimum)

We can see that the region inside the red countour contains the global minimum. Clearly, the contour levels increase as we smoothly move away from the red region. The negative loglikelihood increases towards infinity when we approach the boundaries given \(\lambda_2=0\) and \(\lambda_1 =\pm \pi/2\). The limit when \(\lambda_2\to \infty\) seems to be regular in the sense that all the contour values cross the horizontal line \(\lambda_2=k\) for large enough \(k\) which indicates that, in the limit the negative loglikelihood of \(\lambda_1\) is well defined and has a minimum.

Question 2

Find the maximum likelihood estimate \(\widehat{\lambda}=(\hat{\lambda}_1,\hat{\lambda}_2)^T\) by picking the best out of 100 optimisations (using the BFGS algorithm) where each optimisation uses a different initial value. The following data frame gives the list of initial values to be used.

Code
 L0 <-read.table("http://people.bath.ac.uk/kai21/ASI/CW_2023/starting_vals_q2.txt")

Solution to Question 2

In order to perform unconstrained optimisation, we will reparametrise using the following transformation

\[ \bm{\lambda} = \bm{g}(\bm{\theta}) = \begin{pmatrix} \arctan(\theta_1)\\ \exp(\theta_2) \end{pmatrix} \]

We code first an expression for the log density in terms of \(\bm{\theta}\) and before the summation over \(j\).

Code
expr_log_dens <- 
  expression(
    (
      atan(theta1)*y+exp(theta2)*log(cos(atan(theta1)))+
        (exp(theta2)-2)*log(2)+
          2*lgamma(exp(theta2)/2)-
            log(pi)-
              lgamma(exp(theta2)))/(N+1)-
                log(1+((y/(exp(theta2)+2*j))^2))
    )

Note how we divide by \(N+1\) the terms that do not depend on \(j\). This is to correctly account later for the sum \(\sum_{j=0}^N\) in the log density expression.

We now use deriv to work out the negative loglikelihood (ie summing over \(j\) as well as over the sample) and its corresponding gradient with respect to \(\bm{\theta}\)

Code
deriv_log_dens <- deriv(expr   = expr_log_dens,
                  namevec      = c("theta1","theta2"),
                  function.arg = c("theta1","theta2","y","j","N"),
                  hessian      = F)

# negative loglikelihood

nll    <- function(theta = c(0,0),
                     y     = 1,
                     N     = 1000){
  
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens(theta1 = theta[1],
                           theta2 = theta[2],
                               y  = y[i],
                               j  = 0:N,
                               N  = N)
  
    nld[i] <- sum(as.numeric(aux)) # sum over j
  
  }
 res <- -sum(nld) # sum over sample
  
 
  
res
 
}

# gradient of negative loglikelihood

grad_nll <- function(theta = c(1,1),
                     y     = 1,
                     N     = 1000){
  
  n    <- length(y)
  grad <- matrix(NA,n,2)
  
  for (i in 1:n){
    aux  <- deriv_log_dens(theta1 = theta[1],
                           theta2 = theta[2],
                                y  = y[i],
                                j  = 0:N,
                                N  = N)
  
    grad[i,1:2] <- apply(attr(aux,"gradient"),2,sum)
  
  }
  
 -colSums(grad)
 
}

We now create a function to perform minimisation through the list of all initial values:

Code
#| code-fold: show

fit_optim <- function(L0,
                      fn,
                      gr,
                      y,
                      N = 1000){

fit <- vector("list",length = dim(L0)[1])

for (i in 1:dim(L0)[1]){

  
fit[[i]]<- optim(par      = L0[i,],
                  fn      = fn,
                  gr      = gr ,
                  method  = "BFGS",
                  hessian = T,
                  y       = y,
                  N       = N)
}

extract_negloglik <- 
  function(optim_object){
      optim_object$value
  }

# selects the best optimisation with minimum negative loglikelihood
nll_vals <-
  sapply(X   = fit,
        FUN  = extract_negloglik)

fit[[which.min(nll_vals)]]  # returns the final selected optimisation
}

The best optimisation (in terms of \(\bm{\theta}\) and out of the 100 initial values) is given by:

Code
# reparametrise the initial values into the theta coordinates

L0_reparam <- cbind(tan(L0[,1]),log(L0[,2])) 



fit_q2 <- fit_optim(L0 = L0_reparam,
                    fn = nll ,
                    gr = grad_nll,
                    y  = y_sample_q1,
                    N  = N_fix)

fit_q2
$par
[1] 0.8953146 1.7725376

$value
[1] 636.2355

$counts
function gradient 
      46       14 

$convergence
[1] 0

$message
NULL

$hessian
         [,1]     [,2]
[1,] 816.7478 731.2459
[2,] 731.2459 677.9727

We check that gradient is effectively the zero vector:

Code
grad_nll(theta = fit_q2$par,
         y     = y_sample_q1,
         N     = N_fix)
[1] -1.338970e-05 -1.332534e-05

We also check that the Hessian in the last iteration, by computing its eigenvalues.

Code
eigen(fit_q2$hessian)$values
[1] 1481.8908   12.8297

Both eigenvalues are positive so the Hessian is positive definite and we confirm is a minimum.

We now replot the contours to numerically verify the minimum is in the region indicated above.

Code
contour(x      = lambda1_grid,
        y      = theta2_grid,
        z      = nll_grid,
        levels = levels,
        xlab   = expression(lambda[1]),
        ylab   = expression(lambda[2]))

contour(x      = lambda1_grid,
        y      = theta2_grid,
        z      = nll_grid,
        levels =levels[1],
        col="red",
        add=T)


points(x   = atan(fit_q2$par[1]),
       y   = exp(fit_q2$par[2]),
       col = "red",
       pch = 16)

Question 3

Check the sensitivity of the MLE to the choice of \(N\) by plotting (separately) the values of \(\hat{\lambda}_1\) and \(\hat{\lambda}_2\) as function of \(\log_{10}(N)\). You should use the values \(10^1,10^2, 10^3,10^4,10^5,10^6\) for \(N\). What conclusions can you make from these two plots?

Solution Question 3

Code
NN <- 10^{c(1,2,3,4,5)}

fit_q3 <- vector("list",length =length(NN))
lam1 <- rep(NA,length(NN))
lam2 <- rep(NA,length(NN))
for (i in 1:length(NN)){
  
  
  fit_q3[[i]]<-fit_optim(L0=L0_reparam,
                     fn = nll ,
                     gr = grad_nll,
                     y = y_sample_q1,
                     N = NN[i])
  
  lam1[i] <- atan(fit_q3[[i]]$par[1])
  lam2[i] <- exp(fit_q3[[i]]$par[2])
}

par(mfrow=c(1,2))

plot(log(NN,10),lam1,
     type = "l",
     xlab = "log10(N)",
     ylab=expression(hat(lambda)[1]))

plot(log(NN,10),lam2,
     type = "l",
     xlab = "log10(N)",
     ylab=expression(hat(lambda)[2]))

The plots are fairly constant so the sensitivity of the MLE’s to the chosen value of \(N\) is low.

Question 4

Compute the maximum likelihood estimate of the mean parameter \[\mu(\bm{\lambda}_*)=E[Y|\bm{\lambda}_*]=\int_\rel f(y|\bm{\lambda}_*)dy\] Also compute an asymptotic 95% confidence interval for \(\mu(\bm{\lambda}_*)\). State clearly any assumptions you have made.

Solution to Question 4

Let \[\Psi(\bm{\lambda})=-\lambda_2\,\log(\cos(\lambda_1))\] From the definition of the density we have that

\[ \Psi(\bm{\lambda})=\log \left(\int_\rel \exp(\lambda_1\,y) g(y|\lambda_2)\,dy \right) \]

Assuming that we can interchange derivative with integrals we have

\[ \begin{split} \frac{d}{d\lambda_1}\Psi(\bm{\lambda}) &= \frac{\displaystyle \frac{d}{d\lambda_1}\int_\rel \exp(\lambda_1\,y) g(y|\lambda_2)\,dy} {\exp(\Psi(\bm{\lambda}))}\\ &= \frac{\displaystyle \int_\rel \frac{d}{d\lambda_1}\exp(\lambda_1\,y) g(y|\lambda_2)\,dy} {\exp(\Psi(\bm{\lambda}))}\\ &= \frac{\displaystyle \int_\rel y\exp(\lambda_1\,y) g(y|\lambda_2)\,dy} {\exp(\Psi(\bm{\lambda}))}\\ &= \int_\rel y f(y|\bm{\lambda})\,dy\\ \rule{0in}{3ex}&= E[Y|\bm{\lambda}] \end{split} \]

Therefore we have

\[\mu(\bm{\lambda})=\frac{d}{d\lambda_1}\Psi(\bm{\lambda})=\lambda_2\,\tan(\lambda_1)\]

You can arrive at the same expression by differentiating both sides of the equation with respect to \(\lambda_1\) but the same assumption needs to be made.

Using the first equation of the stationary system \[\nabla_{\!\bm{\lambda}} \ell(\bm{\lambda}|\bm{y}) = \bm{0}_2\]

we obtain

\[\widehat{\mu(\bm{\lambda}^*)}=\mu(\widehat{\bm{\lambda}})=\hat{\lambda}_2\,\tan(\hat{\lambda}_1)=\frac{\sum_{i=1}^n y_i}{n}=\bar{y}=\mbox{5.2696161}\]

Let \(g(\bm{\theta})=\exp(\theta_2)\theta_1\) then, the Jacobian is given is given by

\[ \bm{J}_g (\bm{\theta})= \left[ \nabla_{\!\bm{\theta}} \, g(\bm{\theta}) \right]^T= \left( \exp(\theta_2),\theta_1 \exp(\theta_2) \right) \]

Then we use the Delta method, specifically the alternative version in Proposition 3.4 that used Fisher’s observed information matrix, to obtain the confidence interval

Code
mu_hat <-mean(y_sample_q1)

theta1_hat <- fit_q2$par[1]
theta2_hat <- fit_q2$par[2]

J<-c(exp(theta2_hat),theta1_hat*exp(theta2_hat))

I_obs<-fit_q2$hessian

V <-as.numeric(J%*% solve(I_obs)%*%(J))

# conf int

c(mu_hat-1.96*sqrt(V),mu_hat+1.96*sqrt(V))
[1] 4.865956 5.673276

Question 5

Compute an asymptotic 95% confidence interval for the unknown parameter \(\lambda^*_2\)

  • Using the asymptotic normal approximation of \(\hat{\lambda}_2\)

  • Using the asymptotic normal approximation of \(\log( \hat{\lambda}_2)\)

Solution to Question 5

  • Using the asymptotic normal approximation of \(\hat{\lambda}_2=\exp(\hat{\theta}_2)\) we can use the delta method with \(g(\bm{\theta})=\exp(\theta_2)\) and the Jacobian is now

\[ \bm{J}_g((\bm{\theta})=(0,\exp(\theta_2)) \]

we also need the transformed Hessian

therefore we have

Code
lambda2_hat <- exp(theta2_hat)

J<-c(0,exp(theta2_hat))



V <-as.numeric(J%*% solve(fit_q2$hessian)%*%(J))

# conf int

c(lambda2_hat-1.96*sqrt(V),lambda2_hat+1.96*sqrt(V))
[1] 3.494729 8.276811
  • Using the asymptotic normal approximation of \(\log( \hat{\lambda}_2)\)

We can obtain a confidence interval for \(\lambda_2^*\) by exponentiating an asymptotic confidence interval for \(\theta_2^*=\log(\lambda^*_2)\) based on the asymptotic normal approximation of the sampling distribution of \(\hat{\theta}_2\).

Code
V <-solve(fit_q2$hessian)[2,2]

# conf int

exp(c(theta2_hat-1.96*sqrt(V),theta2_hat+1.96*sqrt(V)))
[1] 3.920804 8.835508

The intervals are a bit different but we tend to prefer the asymmetric one constructed using \(\log(\lambda_2)\) as this one will always be positive while the other one might negative of \(\lambda^*_2\) is close enough to zero.

Question 6

Use the generalised likelihood ratio to test the hypotheses:

\[H_0:\,\mu(\bm{\lambda}_*)=5\qquad \mbox{vs}\qquad H_a:\,\mu(\bm{\lambda}_*)\neq 5\]

using a significance level \(\alpha=0.05\).

Separately, also test

\[H_0:\,\lambda^*_2=5\qquad \mbox{vs}\qquad H_a:\,\lambda^*_2\neq 5\]

using a significance level \(\alpha=0.05\).

Solution to Question 6

Under the null we have that

\[ \mu(\bm{\theta}_*)=\exp(\theta^*_2)\theta^*_1=\mu \]

so we can write \(\theta^*_1= \mu \exp(-\theta_2^*)\) and therefore we can write the negative loglikelihood in terms of \(\theta_2\) to find the minimum under the null.

We first code the corresponding negative loglikelihood of \(\theta_2\) (and gradient) for a general mean value \(\mu\).

Code
library(rlang) 


theta1   <- expr(mu*exp(-theta2))

# note we use the same expression for the log density above just plugging-in the null hyp restriction

expr_log_dens_theta2<- 
  expr(
    (
      atan(!!theta1)*y+exp(theta2)*log(cos(atan(!!theta1)))+
        (exp(theta2)-2)*log(2)+
          2*lgamma(exp(theta2)/2)-
            log(pi)-
              lgamma(exp(theta2))
    )/(N+1)-
      log(1+((y/(exp(theta2)+2*j))^2))
    )


deriv_log_dens_theta2 <- deriv(expr   = expr_log_dens_theta2,
                        namevec      = c("theta2"),
                        function.arg = c("theta2","mu","y","j","N"),
                        hessian      = F)

# negative loglikelihood

nll_theta2    <- function(theta  = 0,
                           y     = 1,
                           N     = 1000,
                           mu    = 5){
  
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_theta2(theta2 = theta,
                                      y  = y[i],
                                      j  = 0:N,
                                      N  = N,
                                      mu = mu)
  
    nld[i] <- sum(as.numeric(aux)) # sum over j
  
  }
  
 res <- -sum(nld) # sum over sample

  
res
 
}


# gradient of negative loglikelihood

grad_nll_theta2 <- function(theta = 0,
                            y     = 1,
                            N     = 1000,
                            mu    = 5){
  
  n    <- length(y)
  grad <- matrix(NA,n,1)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_theta2(theta2 = theta,
                                     y  = y[i],
                                     j  = 0:N,
                                     N  = N,
                                     mu = mu)
  
    grad[i,1] <- apply(attr(aux,"gradient"),2,sum)
  
  }
  
 -colSums(grad)
 
}

grad_nll_theta2(0,y_sample_q1,N_fix,5)
[1] -80.01413

We now optimise

Code
fit_optim_1d <- function(L0,
                      fn,
                      gr,
                      y,
                      N = 1000,
                      mu = 5){

fit <- vector("list",length = length(L0))

for (i in 1:length(L0)){

  
fit[[i]]<- optim(par      = L0[i],
                  fn      = fn,
                  gr      = gr ,
                  method  = "BFGS",
                  hessian = T,
                  y       = y,
                  N       = N,
                  mu      = mu)


}



if (fit[[i]]$hessian <= 0){
  fit[[i]]$value <- Inf
}

extract_negloglik <- 
  function(optim_object){
      optim_object$value
  }

# selects the best optimisation with minimum negative loglikelihood
nll_vals <-
  sapply(X   = fit,
        FUN  = extract_negloglik)



fit[[which.min(nll_vals)]]  # returns the final selected optimisation
}

# we restrict the range of the initial values due to some convergence issues 
# that appear when theta2 is large

fit_q6 <- fit_optim_1d(L0 = seq(1,10,length=100),
                    fn = nll_theta2,
                    gr = grad_nll_theta2,
                    y  = y_sample_q1,
                    N  = N_fix,
                    mu = 5)


fit_q6
$par
[1] 1.778203

$value
[1] 637.1185

$counts
function gradient 
      20        6 

$convergence
[1] 0

$message
NULL

$hessian
         [,1]
[1,] 22.61517

Alternatively, you could use the optimise function in R which is purposely built to do one-dimensional minimisation but without using the gradient (see below)

Code
nll_theta20 <- function(theta2,
                       y,
                       N = 1000,
                       mu){
  
  par_old <- c(mu*exp(-theta2),theta2)
  
  nll(par_old,y=y,N=N)
}


fit_q6_alt <- optimise(f       = nll_theta20,
                      interval = c(0,10),
                      y        = y_sample_q1,
                      N        = N_fix,
                      mu       = 5)
fit_q6_alt
$minimum
[1] 1.778194

$objective
[1] 637.1185

Type ?optimise for details. WE now perform the GLRT test.

Code
glrt <- 2*(-fit_q2$value + fit_q6$value)

glrt
[1] 1.765916
Code
crit_val <-qchisq(0.95,1)

# reject?

glrt>crit_val
[1] FALSE

We do not reject the null hypothesis. This is expected from the plots below.

Code
t2 <- seq(1.5,2,length=1000)
nll_mu <-rep(NA, 1000)



for (i in 1:1000){
nll_mu[i]<- nll_theta2(t2[i],y_sample_q1,mu=5)
}



plot(t2,nll_mu,
     type="l",
     ylab = "negative loglikelihood",
     xlab   = expression(theta[2]))

abline(v=theta2_hat,col="grey")
abline(v=fit_q6$par,col="red")

legend("topright",
       legend = c("minimum under H0","global minimum"),
       col= c("red","grey"),
       lty =1)

Above we have the comparison between the MLE of \(\theta_2\) when \(\mu=5\) (red) to when \(\mu\) is unrestricted, the difference is very small both in terms of the actual estimate and the difference in negative loglikelihood.

Code
contour(x      = lambda1_grid,
        y      = theta2_grid,
        z      = nll_grid,
        levels = levels,
        xlab   = expression(lambda[1]),
        ylab   = expression(lambda[2]))


 t2<-seq(-2,20,length=100)
 x=atan(5/exp(t2))
 y<-exp(t2)
 lines(x,y,col="green")

Above we have the green line that represents all the parameter values in the \(\bm{\lambda}\) parametrisation that satisfy the null hypothesis of \(\mu=5\). The contours with small value cross this green line which gives striong evidence towards the null hypothesis.

Now, the test

\[H_0:\,\lambda^*_2=5\qquad \mbox{vs}\qquad H_a:\,\lambda^*_2\neq 5\]

is equivalent to

\[H_0:\,\theta^*_2=\log(5)\qquad \mbox{vs}\qquad H_a:\,\theta^*_2\neq \log(5)\]

We first code the corresponding negative loglikelihood of \(\theta_1\) (and gradient) for a general mean value \(\mu\).

Code
# note we use the same expression for the log density above just plugging-in the null hyp restriction

expr_log_dens_theta1 <- 
  expression(
    (
      atan(theta1)*y+5*log(cos(atan(theta1)))+
        (5-2)*log(2)+
          2*lgamma(5/2)-
            log(pi)-
              lgamma(5))/(N+1)-
                log(1+((y/(5+2*j))^2))
    )


deriv_log_dens_theta1 <- deriv(expr   = expr_log_dens_theta1,
                        namevec      = c("theta1"),
                        function.arg = c("theta1","mu","y","j","N"),
                        hessian      = F)

# negative loglikelihood

nll_theta1    <- function(theta  = 0,
                           y     = 1,
                           N     = 1000){
  
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_theta1(theta1 = theta,
                                      y  = y[i],
                                      j  = 0:N,
                                      N  = N)
  
    nld[i] <- sum(as.numeric(aux)) # sum over j
  
  }
  
 res <- -sum(nld) # sum over sample
  
 
  
res
 
}


# gradient of negative loglikelihood

grad_nll_theta1 <- function(theta = 0,
                            y     = 1,
                            N     = 1000,
                            mu    = 5){
  
  n    <- length(y)
  grad <- matrix(NA,n,1)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_theta1(theta1 = theta,
                                     y  = y[i],
                                     j  = 0:N,
                                     N  = N)
  
    grad[i,1] <- apply(attr(aux,"gradient"),2,sum)
  
  }
  
 -colSums(grad)
 
}

grad_nll_theta1(0,y_sample_q1,N_fix)
[1] -1317.404

We now optimise

Code
fit_q6_2 <- optim(par    = 0,
                fn     = nll_theta1,
                gr     = grad_nll_theta1,
                method = "BFGS",
                y      = y_sample_q1,
                N      = N_fix)

fit_q6_2
$par
[1] 1.053928

$value
[1] 636.5273

$counts
function gradient 
      21        7 

$convergence
[1] 0

$message
NULL

Alternatively, you could use the optimise function in R which is purposely built to do one-dimensional minimisation but without using the gradient (see below)

Code
nll_theta10 <- function(theta1,
                       y,
                       N = 1000){
  
  par_old <- c(theta1,log(5))
  
  nll(par_old,y=y,N=N)
}


fit_q6_2_alt <- optimise(f       = nll_theta10,
                      interval = c(0,10),
                      y        = y_sample_q1,
                      N        = N_fix)
fit_q6_2_alt
$minimum
[1] 1.053942

$objective
[1] 636.5273

We now perform the GLRT test.

Code
glrt <- 2*(-fit_q2$value + fit_q6_2$value)

glrt
[1] 0.5835991
Code
crit_val <-qchisq(0.95,1)

# reject?

glrt>crit_val
[1] FALSE

We do not reject the null hypothesis at significance level 5%.

Question 7

Consider the following data frame

Code
data_q7 <-read.table("http://people.bath.ac.uk/kai21/ASI/CW_2023/data_q7.txt")

that contains a bivariate sample \[(x_1,y_1),\,(x_2,y_2),\,\ldots,\,(x_n,y_n)\] of size \(n=300\).

Use the parametric family \(\mathcal F_1\) defined in Question 1 to find an appropriate model for the unknown conditional distribution of \(\mathcal Y\) given \(\mathcal X=x\), that is \(f_*(y|x)\). The model should be defined by specifying the mean function \(\mu(\bm{\theta}^{(1)},x)\) as follows:

\[ \mu(\bm{\theta}^{(1)},x) =g^{-1}(\theta_1+\theta_2\,x +\theta_3\,x^2+\theta_4\,x^3 +\cdots+\theta_{p+1}\,x^p) \]

for some choice of link function \(g\) and some choice of integer \(p\geq 1\).

From a set of candidate models (that is for different choices of \(g\) and \(p\)), choose the model with the smallest AIC (Akaike Information Criterion). Only present the results from the maximum likelihood estimation from the best chosen model and simply comment on the other models considered.

Now, repeat the same process above to find an appropriate model for the unknown conditional distribution of \(\mathcal Y\) given \(\mathcal X=x\) but now based on the Gamma parametric family:

\[ \mathcal F_{gamma}=\left\{f(y|\lambda_1,\lambda_2)=\frac{\lambda_2^{\lambda_1}}{\Gamma(\lambda_1)}y^{\lambda_1-1}\exp(-\lambda_2\,y)\,:\, \lambda_1>0\,,\lambda_2>0,y>0\right\} \]

Finally, find an appropriate model for the unknown conditional distribution of \(\mathcal Y\) given \(\mathcal X=x\) but now based on the Normal parametric family:

\[ \mathcal F_{normal}=\left\{f(y|\lambda_1,\lambda_2)=\frac{1}{\lambda_2\sqrt{2\pi}}\,\exp\left(-\frac{(y-\lambda_1)^2}{2\lambda_2^2}\right)\,:\, \lambda_1\in \rel,\,\lambda_2>0,y\in \rel\right\} \]

For each of the three chosen models, you should plot the data together with the maximum likelihood estimate of the mean function as well as corresponding asymptotic 95% confidence bands in the range \(x\in(-3,3)\). Comment on the differences between the confidence bands and the mean function estimates. You must select the best model out of the three, based on the Akaike Information Criterion.

Solution to Question 7

We start with some exploratory visual analysis to give us an idea of which regression models might fit well. We first plot the data together with a nonparametric estimate of the mean function with corresponding 95% confidence bands.

Code
library(tidyverse)

data_q7 %>% 
  ggplot(aes(x,y))+
    geom_point() +
      geom_smooth(aes(col = "nonparametric")) +
        labs(title = "Nonparametric fit for mean function with 95% confidence band",
             col = "Mean function") +
          coord_cartesian(ylim = c(10,30))

A nonparametric estimate of the mean function does not assume an specific parametric model (like the polynomial model for \(\mu(\bm{\theta}^{(1)},x)\) above) and uses the identity link function. The specific nonparametric estimate used by geom_smooth is based on a local regression estimate called LOESS (click on the link for details) . You need to load the tidyverse library to use geom_smooth and also type ?geom_smooth for details on how to use this function. We can see immediately that the identity link function and a second order polynomial (inverted parabola) will fit the data reasonably well. We will start from here.

When we add the option method = "lm" inside geom_smooth, the fit becomes parametric and under the assumption of a normal distribution. The parametric model is fully specified after the specification of the mean equation via the option formula = y ~ x + I(x^2) (in the quadratic case, for example). Below, we plot fits, under normality and using identity link, when the mean function above is linear, quadratic and cubic.

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
      geom_smooth(aes(col = "nonparametric")) +
          geom_smooth(aes(col = "linear fit"),
                      method = "lm",
                      se = F,
                      formula = y ~ x) +
            geom_smooth(aes(col = "quadratic fit"),
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2)) +
              geom_smooth(aes(col = "cubic fit"),
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2) + I(x^3)) +
              geom_smooth(aes(col = "quartic fit"),
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2) + I(x^3) + I(x^4))+
                 labs(col = "Mean function",
                      title =  "Parametric fits of mean function under normality",
                      subtitle = "Identity link") +
                  coord_cartesian(ylim = c(10,25))

Under the normality assumption, we can see clearly the linear fit of the mean function is not good. All quadratic, cubic and fourth order fits are almost identical to the nonparametric estimate of the mean function in the regions where the data lies but differ substantially where there is no data. The shaded confidence band gives us an idea of which models are feasible in the sense that any mean curve inside the shaded band is a statistically plausible model for the mean function. Clearly, the fact there is no data in between the 3 main data clusters makes the confidence bands to be wider there. It is then clear that, to choose between quadratic and cubic models, we need another criterion to decide which fit is better so we will use the AIC. We need to formally fit these models to obtain the AIC and we do that below. Until further notice, we use identity link.

Code
fit_lm_linear <- lm(y~x,data_q7)
fit_lm_quad   <- lm(y~x+I(x^2),data_q7)
fit_lm_cubic  <- lm(y~x+I(x^2)+I(x^3),data_q7)
fit_lm_quart  <- lm(y~x+I(x^2)+I(x^3)+I(x^4),data_q7)
AIC(fit_lm_linear)
[1] 2384.924
Code
AIC(fit_lm_quad  )
[1] 2373.666
Code
AIC(fit_lm_cubic )
[1] 2375.356
Code
AIC(fit_lm_quart)
[1] 2377.22

The AIC clearly choses the quadratic model as the best one according to this criterion but the values are very close to each other.

We proceed now with the GHS model (parametric family \(\mathcal F_1\)) and we fit quadratic, cubic and fourth order mean functions: \[\mu(\bm{\theta},x)=\theta_1+\theta_2\,x+\cdots+\theta_p\,x^{p-1}\qquad p =3,4,5\] We reparametrise \(\theta_{p+1} = \log(\lambda_2)\). The mean is unbounded for this family so no other reparamtrisation is needed for the optimisation.

The corresponding negative loglikelihood (and gradient) in terms of the parameter \(\bm{\theta}=(\theta_1,\theta_2,\ldots,\theta_{p+1})^T\) are coded below:

Code
# we use the same expression as before but we now plug-in expressions for the parametric mean functions
theta1_quad  <- expr((theta1+theta2*x+theta3*(x^2))*exp(-theta4))
theta1_cubic <- expr((theta1+theta2*x+theta3*(x^2)+theta4*(x^3))*exp(-theta5))
theta1_quart <- expr((theta1+theta2*x+theta3*(x^2)+theta4*(x^3)+theta5*(x^4))*exp(-theta6))
               
expr_log_dens_reg_quad <- 
  expr(
    (
      atan(!!theta1_quad)*y+exp(theta4)*log(cos(atan(!!theta1_quad)))+
        (exp(theta4)-2)*log(2)+
          2*lgamma(exp(theta4)/2)-
            log(pi)-
              lgamma(exp(theta4)))/(N+1)-
                log(1+((y/(exp(theta4)+2*j))^2))
    )

expr_log_dens_reg_cubic <- 
  expr(
    (
      atan(!!theta1_cubic)*y+exp(theta5)*log(cos(atan(!!theta1_cubic)))+
        (exp(theta5)-2)*log(2)+
          2*lgamma(exp(theta5)/2)-
            log(pi)-
              lgamma(exp(theta5)))/(N+1)-
                log(1+((y/(exp(theta5)+2*j))^2))
    )

expr_log_dens_reg_quart <- 
  expr(
    (
      atan(!!theta1_quart)*y+exp(theta6)*log(cos(atan(!!theta1_quart)))+
        (exp(theta6)-2)*log(2)+
          2*lgamma(exp(theta6)/2)-
            log(pi)-
              lgamma(exp(theta6)))/(N+1)-
                log(1+((y/(exp(theta6)+2*j))^2))
    )

deriv_log_dens_reg_quad <- deriv(expr   = expr_log_dens_reg_quad,
                      namevec      = c("theta1","theta2","theta3","theta4"),
                      function.arg = c("theta1","theta2","theta3","theta4","x","y","j","N"),
                      hessian      = F)

deriv_log_dens_reg_cubic <- deriv(expr   = expr_log_dens_reg_cubic,
                      namevec      = c("theta1","theta2","theta3","theta4","theta5"),
                      function.arg = c("theta1","theta2","theta3","theta4","theta5","x","y","j","N"),
                      hessian      = F)

deriv_log_dens_reg_quart <- deriv(expr   = expr_log_dens_reg_quart,
                      namevec      = c("theta1","theta2","theta3","theta4","theta5","theta6"),
                      function.arg = c("theta1","theta2","theta3","theta4","theta5","theta6","x","y","j","N"),
                      hessian      = F)

nll_reg_quad    <- function(theta = c(1,1,1,1),
                          x   = 1, 
                          y   = 1,
                          N   = 1000){
  

   
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_quad(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    nld[i] <- sum(as.numeric(aux))
  
  }
  
 -sum(nld)
  
}




grad_nll_reg_quad<- function(theta = c(1,1,1,1),
                            x = 1, 
                            y = 1,
                            N = 1000){
  
  n    <- length(y)
  grad <- matrix(NA,n,4)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_quad(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    grad[i,1:4] <- apply(attr(aux,"gradient"),2,sum)
  
  }
 -colSums(grad)
 
}


nll_reg_cubic    <- function(theta = c(1,1,1,1,1),
                          x   = 1, 
                          y   = 1,
                          N   = 1000){
  

   
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_cubic(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                               theta5  = theta[5],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    nld[i] <- sum(as.numeric(aux))
  
  }
  
 -sum(nld)
  
}




grad_nll_reg_cubic<- function(theta = c(1,1,1,1,1),
                            x = 1, 
                            y = 1,
                            N = 1000){
  
  n    <- length(y)
  grad <- matrix(NA,n,5)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_cubic(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                               theta5  = theta[5],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    grad[i,1:5] <- apply(attr(aux,"gradient"),2,sum)
  
  }
 -colSums(grad)
 
}

nll_reg_quart    <- function(theta = c(1,1,1,1,1,1),
                          x   = 1, 
                          y   = 1,
                          N   = 1000){
  

   
  n   <- length(y)
  nld <- rep(NA,n)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_quart(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                               theta5  = theta[5],
                               theta6  = theta[6],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    nld[i] <- sum(as.numeric(aux))
  
  }
  
 -sum(nld)
  
}




grad_nll_reg_quart<- function(theta = c(1,1,1,1,1,1),
                            x = 1, 
                            y = 1,
                            N = 1000){
  
  n    <- length(y)
  grad <- matrix(NA,n,6)
  
  for (i in 1:n){
    aux  <- deriv_log_dens_reg_quart(theta1  = theta[1],
                               theta2  = theta[2],
                               theta3  = theta[3],
                               theta4  = theta[4],
                               theta5  = theta[5],
                               theta6  = theta[6],
                                    y  = y[i],
                                    x  = x[i],
                                    j  = 0:N,
                                    N  = N)
  
    grad[i,1:6] <- apply(attr(aux,"gradient"),2,sum)
  
  }
 -colSums(grad)
 
}

We take the initial values from the coefficients of the fit of the polynomial model under normality above.

Code
coefficients(fit_lm_quad)
(Intercept)           x      I(x^2) 
 21.9508813   0.1268222  -1.3951485 
Code
coefficients(fit_lm_cubic)
(Intercept)           x      I(x^2)      I(x^3) 
 21.9478684  -2.5147235  -1.3803982   0.6568337 
Code
coefficients(fit_lm_quart)
(Intercept)           x      I(x^2)      I(x^3)      I(x^4) 
 21.8831786  -2.4994190  -0.4111086   0.6481662  -0.2350041 
Code
fit_q7_ghs_quad <-
        optim(c(22,0,-1,1),
              fn = nll_reg_quad,
              gr = grad_nll_reg_quad,
              x = data_q7$x,
              y = data_q7$y,
              N = N_fix,
              hessian = T)

fit_q7_ghs_quad$par
[1] 22.0250596  0.1456677 -1.4202490  0.7977638
Code
fit_q7_ghs_cubic <-
        optim(c(22,-3,-1.5,1,1),
              fn = nll_reg_cubic,
              gr = grad_nll_reg_cubic,
              x = data_q7$x,
              y = data_q7$y,
              N = N_fix,
              hessian = T)

fit_q7_ghs_cubic$par
[1] 22.0946373 -4.5687479 -1.4150127  1.1707690  0.7998508
Code
fit_q7_ghs_quart <-
        optim(c(22,-2,0,1,0,1),
              fn = nll_reg_quart,
              gr = grad_nll_reg_quart,
              x = data_q7$x,
              y = data_q7$y,
              N = N_fix,
              hessian = T)

fit_q7_ghs_quart$par
[1] 21.8120115 -3.1712708  0.6813449  0.8129344 -0.4993293  0.8006595

We note the similarity between the fitted coefficients of the mean function.

We now plot the two fits (quadratic, cubic and fourth) under the GHS response distribution together with the best (quadratic) fit under normality.

Code
n_grid   <- 100
xx       <- seq(-3,3,length=n_grid)
mean_ghs_quad  <- rep(NA,n_grid)
mean_ghs_cubic <- rep(NA,n_grid)
mean_ghs_quart <- rep(NA,n_grid)

ci_ghs_quad   <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)

ci_ghs_cubic  <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)

ci_ghs_quart  <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)


for (i in 1:n_grid){
  vec.x_quad         <- c(1,xx[i],xx[i]^2)  
  vec.x_cubic        <- c(1,xx[i],xx[i]^2,xx[i]^3)  
  vec.x_quart        <- c(1,xx[i],xx[i]^2,xx[i]^3,xx[i]^4)  
  
  est_ghs_quad  <- crossprod(vec.x_quad ,fit_q7_ghs_quad$par[1:3])
  est_ghs_cubic <- crossprod(vec.x_cubic,fit_q7_ghs_cubic$par[1:4])
  est_ghs_quart <- crossprod(vec.x_quart,fit_q7_ghs_quart$par[1:5])
  
  se_ghs_quad   <- sqrt(crossprod(vec.x_quad ,solve(fit_q7_ghs_quad$hessian)[1:3,1:3])%*%vec.x_quad)
  se_ghs_cubic  <- sqrt(crossprod(vec.x_cubic,solve(fit_q7_ghs_cubic$hessian)[1:4,1:4])%*%vec.x_cubic)
  se_ghs_quart  <- sqrt(crossprod(vec.x_quart,solve(fit_q7_ghs_quart$hessian)[1:5,1:5])%*%vec.x_quart)
  
  mean_ghs_quad[i]  <- est_ghs_quad
  mean_ghs_cubic[i] <- est_ghs_cubic
  mean_ghs_quart[i] <- est_ghs_quart
  
  ci_ghs_quad[i,]   <- c(est_ghs_quad -1.96*se_ghs_quad ,est_ghs_quad +1.96*se_ghs_quad)
  ci_ghs_cubic[i,]  <- c(est_ghs_cubic-1.96*se_ghs_cubic,est_ghs_cubic+1.96*se_ghs_cubic)
  ci_ghs_quart[i,]  <- c(est_ghs_quart -1.96*se_ghs_quart ,est_ghs_quart +1.96*se_ghs_quart)
}

GHS_fit_quad<- data.frame(x = xx,
                          mean = mean_ghs_quad,
                          lwr = ci_ghs_quad[,1],
                          upr = ci_ghs_quad[,2])

GHS_fit_cubic<- data.frame(x = xx,
                          mean = mean_ghs_cubic,
                          lwr = ci_ghs_cubic[,1],
                          upr = ci_ghs_cubic[,2])

GHS_fit_quart<- data.frame(x = xx,
                          mean = mean_ghs_quart,
                          lwr = ci_ghs_quart[,1],
                          upr = ci_ghs_quart[,2])


data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
      geom_smooth(aes(col = "z_nonparametric"),
                  se = F) +
            geom_smooth(aes(col = "quadratic fit (normal)"),
                        lty = 2,
                        lwd = 2,
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2)) +
  geom_smooth(aes(col = "cubic fit (normal)"),
              lty = 2,
                        lwd = 2,
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2) + I(x^3)) +
  geom_smooth(aes(col = "fourth fit (normal)"),
                        lty = 2,
                        lwd = 2,
                        method = "lm",
                      se = F,
                      formula = y ~ x + I(x^2) + I(x^3) + I(x^4)) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (GHS)"),
                  data  = GHS_fit_quad) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "cubic fit (GHS)"),
                  data  = GHS_fit_cubic) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "fourth order fit (GHS)"),
                  data  = GHS_fit_quart)+
        coord_cartesian(ylim = c(14,25))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Again, all the parametric fits agree closely where there is data and differ elsewhere except for the two quadratic fits. In terms of AIC (see below) any of the fitted GHS parametric models is much better than the the best normal (quadratic) model.

Code
AIC_GHS_quad  <- 2*length(fit_q7_ghs_quad$par) +2*fit_q7_ghs_quad$value
AIC_GHS_cubic <- 2*length(fit_q7_ghs_cubic$par)+2*fit_q7_ghs_cubic$value 
AIC_GHS_quart <- 2*length(fit_q7_ghs_quart$par)+2*fit_q7_ghs_quart$value 
AIC_GHS_quad
[1] 2257.687
Code
AIC_GHS_cubic
[1] 2258.8
Code
AIC_GHS_quart
[1] 2260.424
Code
AIC(fit_lm_quad)
[1] 2373.666

Now we proceed with the Gamma distribution

Code
lp_quad  <- expr(theta1+theta2*x+theta3*(x^2))
lp_cubic <- expr(theta1+theta2*x+theta3*(x^2)+theta4*(x^3))
lp_quart <- expr(theta1+theta2*x+theta3*(x^2)+theta4*(x^3)+theta5*(x^4))

expr_log_dens_gamma_quad <- 
  expr(
     exp(theta4)*theta4-
       lgamma(exp(theta4))+
        (exp(theta4)-1)*log(y)-
          exp(theta4)*log(!!lp_quad)-
            exp(theta4)*y/(!!lp_quad)
  )

expr_log_dens_gamma_cubic <- 
  expr(
     exp(theta5)*theta5-
       lgamma(exp(theta5))+
        (exp(theta5)-1)*log(y)-
          exp(theta5)*log(!!lp_cubic)-
            exp(theta5)*y/(!!lp_cubic)
  )

expr_log_dens_gamma_quart <- 
  expr(
     exp(theta6)*theta6-
       lgamma(exp(theta6))+
        (exp(theta6)-1)*log(y)-
          exp(theta6)*log(!!lp_quart)-
            exp(theta6)*y/(!!lp_quart)
  )


deriv_log_dens_gamma_quad <- deriv(expr   = expr_log_dens_gamma_quad,
                  namevec      = c("theta1","theta2","theta3","theta4"),
                  function.arg = c("theta1","theta2","theta3","theta4","x","y"),
                  hessian      = F)

deriv_log_dens_gamma_cubic <- deriv(expr   = expr_log_dens_gamma_cubic,
                  namevec      = c("theta1","theta2","theta3","theta4","theta5"),
                  function.arg = c("theta1","theta2","theta3","theta4","theta5","x","y"),
                  hessian      = F)

deriv_log_dens_gamma_quart <- deriv(expr   = expr_log_dens_gamma_quart,
                  namevec      = c("theta1","theta2","theta3","theta4","theta5","theta6"),
                  function.arg = c("theta1","theta2","theta3","theta4","theta5","theta6","x","y"),
                  hessian      = F)

nll_reg_gamma_quad    <- function(theta = c(1,1,1,1),
                                  x = 1, 
                                  y = 1){
  

    aux  <- deriv_log_dens_gamma_quad(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                                y  = y,
                                x = x)
 
  
     -sum(as.numeric(aux))

}




grad_nll_gamma_quad<- function(theta = c(1,1,1,1),
                               x = 1, 
                               y = 1){
  

    aux  <- deriv_log_dens_gamma_quad(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                                y  = y,
                                x = x)
  
  -apply(attr(aux,"gradient"),2,sum)
  
  
 
}


nll_reg_gamma_cubic    <- function(theta = c(1,1,1,1,1),
                                  x = 1, 
                                  y = 1){
  

    aux  <- deriv_log_dens_gamma_cubic(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                           theta5  = theta[5],
                                y  = y,
                                x = x)
 
  
     -sum(as.numeric(aux))

}




grad_nll_gamma_cubic<- function(theta = c(1,1,1,1,1),
                               x = 1, 
                               y = 1){
  

    aux  <- deriv_log_dens_gamma_cubic(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                           theta5  = theta[5],
                                y  = y,
                                x = x)
  
  -apply(attr(aux,"gradient"),2,sum)
  
  
 
}

nll_reg_gamma_quart    <- function(theta = c(1,1,1,1,1,1),
                                  x = 1, 
                                  y = 1){
  

    aux  <- deriv_log_dens_gamma_quart(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                           theta5  = theta[5],
                           theta6  = theta[6],
                                y  = y,
                                x = x)
 
  
     -sum(as.numeric(aux))

}




grad_nll_gamma_quart<- function(theta = c(1,1,1,1,1,1),
                               x = 1, 
                               y = 1){
  

    aux  <- deriv_log_dens_gamma_quart(theta1  = theta[1],
                           theta2  = theta[2],
                           theta3  = theta[3],
                           theta4  = theta[4],
                           theta5  = theta[5],
                           theta6  = theta[6],
                                y  = y,
                                x = x)
  
  -apply(attr(aux,"gradient"),2,sum)
  
  
 
}

Again we take the initial values from the coefficients of the fit of the polynomial model under normality above.

Code
coefficients(fit_lm_quad)
(Intercept)           x      I(x^2) 
 21.9508813   0.1268222  -1.3951485 
Code
coefficients(fit_lm_cubic)
(Intercept)           x      I(x^2)      I(x^3) 
 21.9478684  -2.5147235  -1.3803982   0.6568337 
Code
coefficients(fit_lm_quart)
(Intercept)           x      I(x^2)      I(x^3)      I(x^4) 
 21.8831786  -2.4994190  -0.4111086   0.6481662  -0.2350041 
Code
fit_q7_gamma_quad<-optim(c(22,0,-1,1),
                     fn = nll_reg_gamma_quad,
                     gr = grad_nll_gamma_quad,
                     x = data_q7$x,
                     y = data_q7$y,
                     hessian = T)

fit_q7_gamma_quad$par
[1] 22.0186932  0.1450894 -1.4188934  0.7253400
Code
fit_q7_gamma_cubic<-optim(c(22,-2,-1,1,1),
                     fn = nll_reg_gamma_cubic,
                     gr = grad_nll_gamma_cubic,
                     x = data_q7$x,
                     y = data_q7$y,
                     hessian = T)

fit_q7_gamma_cubic$par
[1] 22.0633394 -4.5889329 -1.4069488  1.1766088  0.7277329
Code
fit_q7_gamma_quart<-optim(c(22,-2,-1,1,0,1),
                     fn = nll_reg_gamma_quart,
                     gr = grad_nll_gamma_quart,
                     x = data_q7$x,
                     y = data_q7$y,
                     hessian = T)

fit_q7_gamma_quart$par
[1] 21.8997921 -1.5466114  0.1868520  0.4182851 -0.3861897  0.7280344

We note the similarity between the fitted coefficients of the mean function.

We now plot the two fits (quadratic, cubic and fourth) under the Gamma response distribution together with the best (quadratic) fit under GHS.

Code
n_grid   <- 100
xx       <- seq(-3,3,length=n_grid)
mean_gamma_quad  <- rep(NA,n_grid)
mean_gamma_cubic <- rep(NA,n_grid)
mean_gamma_quart <- rep(NA,n_grid)

ci_gamma_quad   <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)

ci_gamma_cubic  <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)

ci_gamma_quart  <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)


for (i in 1:n_grid){
  vec.x_quad         <- c(1,xx[i],xx[i]^2)  
  vec.x_cubic        <- c(1,xx[i],xx[i]^2,xx[i]^3)  
  vec.x_quart        <- c(1,xx[i],xx[i]^2,xx[i]^3,xx[i]^4)  
  
  est_gamma_quad  <- crossprod(vec.x_quad ,fit_q7_gamma_quad$par[1:3])
  est_gamma_cubic <- crossprod(vec.x_cubic,fit_q7_gamma_cubic$par[1:4])
  est_gamma_quart <- crossprod(vec.x_quart,fit_q7_gamma_quart$par[1:5])
  
  se_gamma_quad   <- sqrt(crossprod(vec.x_quad ,solve(fit_q7_gamma_quad$hessian)[1:3,1:3])%*%vec.x_quad)
  se_gamma_cubic  <- sqrt(crossprod(vec.x_cubic,solve(fit_q7_gamma_cubic$hessian)[1:4,1:4])%*%vec.x_cubic)
  se_gamma_quart  <- sqrt(crossprod(vec.x_quart,solve(fit_q7_gamma_quart$hessian)[1:5,1:5])%*%vec.x_quart)
  
  mean_gamma_quad[i]  <- est_gamma_quad
  mean_gamma_cubic[i] <- est_gamma_cubic
  mean_gamma_quart[i] <- est_gamma_quart
  
  ci_gamma_quad[i,]   <- c(est_gamma_quad -1.96*se_gamma_quad ,est_gamma_quad +1.96*se_gamma_quad)
  ci_gamma_cubic[i,]  <- c(est_gamma_cubic-1.96*se_gamma_cubic,est_gamma_cubic+1.96*se_gamma_cubic)
  ci_gamma_quart[i,]  <- c(est_gamma_quart -1.96*se_gamma_quart ,est_gamma_quart +1.96*se_gamma_quart)
}

GAMMA_fit_quad<- data.frame(x = xx,
                          mean = mean_gamma_quad,
                          lwr = ci_gamma_quad[,1],
                          upr = ci_gamma_quad[,2])

GAMMA_fit_cubic<- data.frame(x = xx,
                          mean = mean_gamma_cubic,
                          lwr = ci_gamma_cubic[,1],
                          upr = ci_gamma_cubic[,2])

GAMMA_fit_quart<- data.frame(x = xx,
                          mean = mean_gamma_quart,
                          lwr = ci_gamma_quart[,1],
                          upr = ci_gamma_quart[,2])


data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
      geom_smooth(aes(col = "z_nonparametric"),
                  se = F) +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (GHS)"),
                    lty = 2,
                    lwd = 2,
                  data  = GHS_fit_quad) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "cubic fit (GHS)"),
                      lty =2,
                      lwd = 2,
                  data  = GHS_fit_cubic) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "fourth order fit (GHS)"),
                      lty =2,
                      lwd = 2,
                  data  = GHS_fit_quart) +  
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (Gamma)"),
                  data  = GAMMA_fit_quad) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "cubic fit (Gamma)"),
                  data  = GAMMA_fit_cubic) +
      geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "fourth order fit (Gamma)"),
                  data  = GAMMA_fit_quart)+
        coord_cartesian(ylim = c(14,25))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Again, all the parametric fits agree closely where there is data and differ elsewhere except for the two quadratic fits. In terms of AIC (see below) any of the fitted Gamma parametric models is much better than the the same order normal model but no better than the same order GHS model.

Code
AIC_GAMMA_quad  <- 2*length(fit_q7_gamma_quad$par) +2*fit_q7_gamma_quad$value
AIC_GAMMA_cubic <- 2*length(fit_q7_gamma_cubic$par)+2*fit_q7_gamma_cubic$value 
AIC_GAMMA_quart <- 2*length(fit_q7_gamma_quart$par)+2*fit_q7_gamma_quart$value 


AIC_GHS_quad
[1] 2257.687
Code
AIC_GAMMA_quad
[1] 2268.396
Code
AIC(fit_lm_quad)
[1] 2373.666
Code
AIC_GHS_cubic
[1] 2258.8
Code
AIC_GAMMA_cubic
[1] 2269.543
Code
AIC(fit_lm_cubic)
[1] 2375.356
Code
AIC_GHS_quart
[1] 2260.424
Code
AIC_GAMMA_quart
[1] 2271.435
Code
AIC(fit_lm_quart)
[1] 2377.22

Below we can see the three best models under each distribution with corresponding 95% confidence bands and clearly they are almost indistinguishable.

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
            geom_smooth(aes(col = "quadratic fit (normal)"),
                        method = "lm",
                      se = T,
                      formula = y ~ x + I(x^2)) +
        coord_cartesian(ylim = c(10,25),
                        xlim = c(-3,3))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (GHS)"),
                    lwd=1.5,
                  data  = GHS_fit_quad) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GHS_fit_quad,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,25),
                        xlim = c(-3,3))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (Gamma)"),
                    lwd=1.5,
                  data  = GAMMA_fit_quad) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GAMMA_fit_quad,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,25))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Finally, below we plot all the GHS fits since they were the best overall all the models tested. Clearly, the cubic and fourth order models are have too much variability when accounting for the confidence bands and therefore is an optimistic choice to stay with the quadratic model in this case. The most contentious part is outside the data range where we would be extrapolating therefore either more data or more assumptions are needed to find a better model.

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "cubic fit (GHS)"),
                    lwd=1.5,
                  data  = GHS_fit_cubic) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GHS_fit_cubic,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,30))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "fourth order fit (GHS)"),
                    lwd=1.5,
                  data  = GHS_fit_quart) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GHS_fit_quart,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,30))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Question 8

Use the data in Question 7 to compute 95% confidence intervals for the least worse value of the mean function at each \(x\), that is \(\mu(\bm{\theta}^{(1)}_\dagger,x)\) for each of the three parametric families: \(\mathcal F_1\), the Gamma and the Normal. Plot the computed confidence bands in the range \(x\in(-3,3)\) for each parametric family and comment on the differences obtained.

Question 8 Solution

Code
n_sample <- length(data_q7$x)


K_ghs <- matrix(0, nrow = 4,4)

for (i in 1:n_sample){

  s <- grad_nll_reg_quad(fit_q7_ghs_quad$par,
                         x = data_q7$x[i],
                         y = data_q7$y[i],
                         N = N_fix)  
  K_ghs <- K_ghs + tcrossprod(s)  
}

K_ghs <- K_ghs/n_sample 

J_ghs <-  fit_q7_ghs_quad$hessian

AV_ghs <- solve(J_ghs) %*% K_ghs %*% solve(J_ghs)


AV_ghs
              [,1]          [,2]          [,3]          [,4]
[1,]  7.076443e-03 -5.945529e-05 -1.773070e-03 -1.586995e-05
[2,] -5.945529e-05  5.351631e-04  5.307525e-06  5.289956e-06
[3,] -1.773070e-03  5.307525e-06  5.781330e-04  3.400070e-06
[4,] -1.586995e-05  5.289956e-06  3.400070e-06  2.431578e-05
Code
solve(J_ghs)
              [,1]          [,2]          [,3]          [,4]
[1,]  2.1886905704 -3.568070e-03 -5.440518e-01 -8.463130e-05
[2,] -0.0035680697  1.525185e-01  4.597482e-03 -9.943164e-06
[3,] -0.5440518101  4.597482e-03  1.731418e-01  2.277209e-05
[4,] -0.0000846313 -9.943164e-06  2.277209e-05  6.936270e-03

WE see they are quite similar which means is very likely the data generating distribution \(f_*\) is close to a GHS distribution.

Code
K_gamma <- matrix(0, nrow = 4,4)

for (i in 1:n_sample){

  s <- grad_nll_gamma_quad(fit_q7_gamma_quad$par,
                         x = data_q7$x[i],
                         y = data_q7$y[i])  
  K_gamma <- K_gamma + tcrossprod(s)  
}

K_gamma <- K_gamma/n_sample

J_gamma <-  fit_q7_gamma_quad$hessian

AV_gamma <- solve(J_gamma) %*% K_gamma %*% solve(J_gamma)


AV_gamma
              [,1]          [,2]          [,3]          [,4]
[1,]  7.071397e-03 -5.825315e-05 -1.772466e-03 -6.450621e-06
[2,] -5.825315e-05  5.353098e-04  4.809184e-06  1.600373e-06
[3,] -1.772466e-03  4.809184e-06  5.782194e-04  4.637303e-06
[4,] -6.450621e-06  1.600373e-06  4.637303e-06  2.790382e-05
Code
solve(J_gamma)
              [,1]          [,2]          [,3]          [,4]
[1,]  2.326873e+00 -3.608096e-03 -5.784390e-01 -6.200143e-06
[2,] -3.608096e-03  1.609681e-01  4.820358e-03  1.817196e-06
[3,] -5.784390e-01  4.820358e-03  1.838023e-01  1.318940e-06
[4,] -6.200143e-06  1.817196e-06  1.318940e-06  5.772139e-03

WE see they are quite similar which means is very likely the data generating distribution \(f_*\) is close to a Gamma distribution. Therefore the Gamma and GHS can be close to each other.

Code
n_grid   <- 100
xx       <- seq(-3,3,length=n_grid)

mean_ghs_quad    <- rep(NA,n_grid)
mean_gamma_quad  <- rep(NA,n_grid)

ci_ghs_quad   <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)


ci_gamma_quad   <- matrix(NA,
                       nrow = n_grid,
                       ncol = 2)


for (i in 1:n_grid){

  vec.x_quad         <- c(1,xx[i],xx[i]^2)  

  est_ghs_quad    <- crossprod(vec.x_quad ,fit_q7_ghs_quad$par[1:3])
  est_gamma_quad  <- crossprod(vec.x_quad ,fit_q7_gamma_quad$par[1:3])

  se_ghs_quad    <- sqrt(crossprod(vec.x_quad ,AV_ghs[1:3,1:3])%*%vec.x_quad)
  se_gamma_quad  <- sqrt(crossprod(vec.x_quad ,AV_gamma[1:3,1:3])%*%vec.x_quad)

  mean_ghs_quad[i]    <- est_ghs_quad
  mean_gamma_quad[i]  <- est_gamma_quad
  
  ci_ghs_quad[i,]   <- c(est_ghs_quad -1.96*se_ghs_quad ,est_ghs_quad +1.96*se_ghs_quad)
  ci_gamma_quad[i,]   <- c(est_gamma_quad -1.96*se_gamma_quad ,est_gamma_quad +1.96*se_gamma_quad)
  
}

GHS_fit_quad<- data.frame(x = xx,
                          mean = mean_ghs_quad,
                          lwr = ci_ghs_quad[,1],
                          upr = ci_ghs_quad[,2])


GAMMA_fit_quad<- data.frame(x = xx,
                          mean = mean_gamma_quad,
                          lwr = ci_gamma_quad[,1],
                          upr = ci_gamma_quad[,2])


data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (GHS)"),
                    lwd=1,
                  data  = GHS_fit_quad) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GHS_fit_quad,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,25))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")

Code
data_q7 %>% 
  ggplot(aes(x,
             y))+
    geom_point() +
          geom_line(mapping = aes(x    = x,
                                y = mean,
                              col = "quadratic fit (Gamma)"),
                    lwd=1,
                  data  = GAMMA_fit_quad) +
               geom_ribbon(mapping = aes(x    = x,
                                y = mean,
                                ymin = lwr,
                                ymax = upr),
                  data  = GAMMA_fit_quad,
                  alpha = 0.5,
                  fill  = "gray")+
        coord_cartesian(ylim = c(10,25))+
                 labs(col = "Mean function") +
          theme(legend.position = "bottom")